Instructions

Before you start

The practical will be run using RStudio cloud to avoid individual machine troubleshooting. Therefore, you need to have an account for RStudio cloud.

However, remember to have R (https://cran.r-project.org) and R-studio (http://www.rstudio.com/) in your/room computer. In that way, you can work on the project after class.

The goal.

This practical aims to implement what you have learned in the class about Exploratory Data Analysis in R. You should have read chapter 18 of Crawley (2012) the R book

Learning objectives.

  1. Implement and interpret a Classification/Regression Trees analysis in R.
  2. Implement and analyse a Random Forest analysis in R.
  3. Implement and analyse a Boosted regression Tress analysis in R.

The way it’s going to be run.

A 3h practical is divided into two parts. DO NOT COPY AND PASTE THE CODE! Write your own R-code and run it. That is the best way to learn how this program works. Also, you have an excellent teaching assistant resource - ask him and/or me all the questions you have.

The text in the blue box marked as Your task: states what you need to do

# The code in the window marked like this shows where you need to write your code
#
# If I ask you to assess the result or answer a question, you should write the text here as a comment starting with two hashes (##)

Assessment.

Submit your knitted R-markdown file (the HTML) via BrightSpace - Before the next practical! The assessment is a Pass/Fail depending on how you write and annotate the code - You need to show us you know what you are doing and NOT copying someone else’s code.

Supervised Classification of satellite images

Today you will implement a Classification/Regression algorithm to generate a model that can help you predict which “land-cover” class is the most likely to be observed given an observed hyperspectral signature (the reflectance of the bands captured in a hyperspectral image). You will explore various supervised classification algorithms Classification and Regression Trees (CART), Random Forest (RF), and Boosted regression trees (BRT). The practical goal is that you realise that the choice of algorithm WILL affects the results. Also, today’s practical will build on the work done a few weeks back, where you build an unsupervised classification of a satellite image.

Refresher - Basics of Spatial Information (Loading/Plotting/Processing)

As for the classification exercise, you will need to load a series of raster images. For this, you will use the raster R package.

The first step is to load ALL THE BANDS in the Landsat image in the file Aarhus_2019_utm_L8.tif into R. For this, you will use the function stack to create a RasterStack.

Your task:

Load ALL the bands from the Aarhus_2019_utm_L8.tif, located in R-Studio cloud data [see the Files tab in the lower right window].

For this, you will need to:

  • Define an object named Dir.Loc with the string of the R-studio cloud location.

  • Use stack() to load the file and save it as an object named landsat8 - The file is a multi-band GeoTIFF.

  • Using the names() function, Now change the names of the bands in landsat8 as: Band 1 = “Coastal”; Band 2 = “Blue”; Band 3 = “Green”; Band 4 = “Red”; Band 5 = “NIR”; Band 6 = “SWIR1”; Band 6 = “SWIR2”.

# Load the required library (raster)
library("raster")

# Use `stack()` to load the file and save it as an object named landsat8.
landsat8 <- stack("Data/Aarhus_2019_utm_L8.tif")

# Change the names of landsat8 using the names() function.
names(landsat8) <- c("Coastal", "blue", "green", "red", "NIR", "SWIR1", "SWIR2")

# Print the output of the landsat8 object.
landsat8
## class      : RasterStack 
## dimensions : 332, 493, 163676, 7  (nrow, ncol, ncell, nlayers)
## resolution : 30, 30  (x, y)
## extent     : 564270, 579060, 6219660, 6229620  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=32 +ellps=GRS80 +units=m +no_defs 
## names      :   Coastal,      blue,     green,       red,       NIR,     SWIR1,     SWIR2 
## min values :         0,         0,         0,         0,         0,         0,         0 
## max values : 0.3477450, 0.3674900, 0.4233425, 0.4276600, 0.5316375, 0.5013050, 0.4579925

Now that you have a RasterStack with the Coastal, blue, green, red, NIR, SWIR1 and SWIR2 bands (the names of bands 1 to 7), you will create a TRUE colour a FALSE colour image for the area of interest.

Your task:

Plot both a true and false colour image of the region of interest using the plotRGB() function specifying in both a linear and histogram stretch.

# Use the plotRGB() function with a linear stretch to make the true colour composites - these follow an RGB order
plotRGB(landsat8[[c(4, 3, 2)]], # True colour composites flow a RGB order.
  r = 1, g = 2, b = 3, # Define which band is red, green, and blue.
  scale = 1, # defined the maximum value a band can take (used to scaling when colouring).
  main = "Landsat True Color Composite\nlin stretch", # add a title to the figure.
  stretch = "lin", # How to stretch the values to increase the contrast?
  margins = TRUE
)

# Use the plotRGB() function with a linear stretch to make the false colour composites - these follow an a NRI, red, green order.
plotRGB(landsat8[[c(5, 4, 3)]], # True colour composites flow an RGB order.
  r = 1, g = 2, b = 3, # Define which band is red, green, and blue.
  scale = 1, # defined the maximum value a band can take (used to scaling when colouring).
  main = "Landsat False Color Composite\nlin stretch", # add a title to the figure.
  stretch = "lin", # How to stretch the values to increase the contrast?
  margins = TRUE
)

Based on the image above, you can have a good perspective on where you can observe particular land cover classes.

Supervised classification - the basics.

In supervised classification, you have prior knowledge about some of the features of interest (in this case, land-cover types). In the context of satellite images, this prior knowledge can come from, for example, fieldwork, reference spatial data or interpretation of high-resolution imagery (such as available on Google maps). Specific sites in the study area that represent homogeneous examples of these known land-cover types are identified. These areas are commonly referred to as training sites because the spectral properties of these sites are used to train the classification algorithm.

The next section of the practical will focus on:

  1. Generating sample sites based on a reference raster.
  2. Extract reflectance values from the Landsat image for the sample sites.
  3. Train the classifier using training samples.
  4. Classify the Landsat data using the trained model.
  5. Evaluate the model’s accuracy by comparing how good the model predicts another random sample of sites.

Reference data

As stated above, supervised classification is based on the principle that you have prior knowledge about some of the features of interest. In this case, you will use BASEMAP classification for the year 2018 to identify the spectral signal of predefined land cover classes. The BASEMAP is a 30m multiple sources land cover database with predictions for four epochs (2011, 2016 and 2018). the file BASEMAP2018.tif has one set of class values and names that correspond to the levels of land use and land cover classification (Described in the Aggregated Legend.csv file).

Your task:

Load the BASEMAP data into R. As with the Landsat image used before, the BASEMAP data in the file BASEMAP-2018.tif is located in R-Studio cloud data [see the Files tab in the lower right window].

# Load the file and save it as an object named BASEMAP2018
BASEMAP2018 <- raster("Data/BASEMAP-2018.tif")

You can now plot the BASEMAP and one image representing the 2018 clasification.

Your task:

Plot the Land Cover classification raster. Here, define a 34 values contrasting colour scheme using the hcl.colors() function.

# plot the `BASEMAP2018`
image(
  x = BASEMAP2018, # raster to plot.
  col = sample(hcl.colors(35, palette = "Zissou")) # Define the colour ramp
)

However, it is impossible to determine the land-cover class for each grid cell based on this figure. For this, you will need to define the class names for each class (the text name for each value in the raster). These are in the Aggregated Legend.csv file, and listed below

Value Land Cover class
110000 Building
121000 Low built up
121110 Low built up; Building
122000 High built up
122110 High built up; Building
123000 City centre
123110 City centre; Building
124000 Other built up
124110 Other built up; Building
125000 Industry/business
125110 Industry/business; Building
126000 Airport/runway
126110 Airport/runway; Building
130000 Recreation area/sports ground
130110 Recreation area/sports ground; Building
141000 Road, paved
142000 Road, not paved
150000 Railway
150110 Railway; Building
160000 Resource extraction
211000 Agriculture, intensive, temporary crops
212000 Agriculture, intensive, permanent crops
220000 Agriculture, extensive
230000 Agriculture, not classified
311000 Forest
312000 Forest, wet
321000 Nature, dry
321220 Nature, dry; Agriculture, extensive
322000 Nature, wet
322220 Nature, wet; Agriculture, extensive
411000 Lake
412000 Stream
420000 Sea
800000 Unmapped

From the table above is clear that there are still many classes. But if you read the table above carefully, it is clear that the first two (2) digits group the data in BASEMAP2018 into roughly eight (8) classes:

Value Land Cover class
11 & 12 Build
13 Recreation
14& 15 Infrastructure
16 Mining
21 Agriculture
31 & 32 Nature
41 & 42 Aquatic
80 Unmapped

Your task:

Reclassify BASEMAP2018 into these seven categories using the reclassify()function

# Create a table to reclassify
rclmat <- matrix(c(
  109999, 129999, 1, # Build
  129999, 139999, 2, # Recreation
  140000, 159999, 3, # Infrastructure
  160000, 169999, 4, # Mining
  200000, 290000, 5, # Agriculture
  300000, 390000, 6, # Nature
  400000, 499999, 7, # Aquatic
  500000, Inf, NA
), # Unmapped
ncol = 3, byrow = TRUE
)

# reclassify the values into seven groups
BASEMAP2018ReCls <- reclassify(
  BASEMAP2018,
  rclmat
)

# plot the reclassify raster
image(
  x = BASEMAP2018ReCls, # raster to plot.
  col = c("grey", "light green", "black", "orange", "dark green", "blue")
)

Now you will use the function ratify from the raster package to define the BASEMAP2018ReCls RasterLayer as a categorical image. I recommend that you take a step back and read more about what ratify does and how to use the function by using the command ?ratify.

Your task:

Make the BASEMAP2018ReCls a Raster Attribute Table using the function ratify.

Then, plot this “categorical” raster using the function image and manually adding a legend for the land cover classes.

# Re-define the BASEMAP2018ReCls RasterLayer as a categorical variable using the ratify function.
BASEMAP2018ReCls2 <- ratify(x = BASEMAP2018ReCls)

## Now define the levels for each value in the ratified RasterLayer
# Create a data.frame with the Raster levels and the Names
rat <- data.frame(
  ID = c(1:7, NA), # The attributes values.
  landcover = c("Build", "Recreation", "Infrastructure", "Mining", "Agriculture", "Nature", "Aquatic", "Unmapped") # The name-string for each class
)

# Attach the attributes lookup table data.frame (rat) as the levels of the ratified raster.
levels(BASEMAP2018ReCls2) <- rat


# Define the Plotting Space
par(mar = c(3, 3, 4, 7))

# Plot the the ratified raster using the Image function. Use the hex colours in classdf - but remember that one possible value (6) is not defined in classdf and hence has no colour.
image(BASEMAP2018ReCls2,
  main = "Land cover classification 2018", # Give the Name
  col = c("grey", "light green", "black", "red", "orange", "dark green", "light blue")
)

# Add a legend linking land cover class names to the colours.
legend("topright", # Define the position the legend.
  fill = c("grey", "light green", "black", "red", "orange", "dark green", "light blue", "white"), # Define the colours of each land cover class.
  legend = c("Build", "Recreation", "Infrastructure", "Mining", "Agriculture", "Nature", "Aquatic", "Unmapped"), # Define each land cover class name.
  cex = 0.7, # Define the text size
  xpd = NA
) # Plot outside the Figure region

# Plot the the ratified raster using the rasterVis package
rasterVis::levelplot(
  x = BASEMAP2018ReCls2, # Raster to plot?
  col.regions = c("grey", "light green", "black", "red", "orange", "dark green", "light blue", "white"), # Use the classcolor variable in classdf to define the colours.
  main = "Land Cover Clasifiction" # Main Title
)

Generate a pool of sampled sites

Training and/or validation data can come from a variety of sources. In this example, you will generate the training and validation sample sites using the BASEMAP2018 reference RasterLayer. You will generate the sample sites following a stratified random sampling to ensure each Land-Use/Land-Cover class is sampled evenly.

Your task:

Use the function sampleStratified from the raster package to generate a stratified random sample from the cell values of BASEMAP2018ReCls.

As a reminder, a stratified random sample allows researchers to obtain a sample population that best represents the entire population being studied. The function sampleStratified performs a proportional random sampling for each stratum (her land cover classes).

# Define  the training sites locations
# Set the random number generator to make the results reproducible.
set.seed(99)
# Sampling
samp2018 <- sampleStratified(
  x = BASEMAP2018ReCls, # Raster to be "sampled"
  size = 200, # Number of samples per class
  na.rm = TRUE, # NA values are removed from random sample?
  sp = TRUE # Output as a SpatialPointsDataFrame?
)

# What kind of object is samp2018?
class(samp2018)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
# Explore the data in samp2018 using the head() function.
head(samp2018)
##     cell layer
## 1  27046     1
## 2  68825     1
## 3  40305     1
## 4 110799     1
## 5 154563     1
## 6 155500     1

The stratified random sampling output is a SpatialPointsDataFrame object, which is R-speak for a point-shapefile ( When creating samp2018, setting the sp argument to FALSE will generate a data.frame).

When printing the samp2018 object, you can see there are two variables. First, the cell column, that contains the sampled cell numbers from BASEMAP2018ReCls. Second, the layer column contains the class values (1-7) corresponding to each Land Cover Class.

Your task:

Before you go any further, plot the BASEMAP2018ReCls2 raster and add the samp2018 object over it.

# Plot the Land cover raster. Use the hex colours in classdf - but remember that one possible value (6) is not defined in classdf and hence has no colour.
image(BASEMAP2018ReCls2,
  main = "Land cover classification 2018", # Give the Name
  col = c("grey", "light green", "black", "red", "orange", "dark green", " light blue")
)

# Add the sampled points over the land cover raster using the points() function.
points(samp2018, # Define the SpatialPointsDataFrame to plot
  pch = 10, # Define the type of point.
  col = "red"
)

# Add a legend Showing the points as sampled locations
legend("bottomright", # Define the position the legend.
  pch = 10, # Define the point shape for sampled locations.
  col = "red",
  legend = "Sampled locations.", # Define How you are naming these sampled locations.
  cex = 0.7,
  xpd = NA
) # Plot outside the Figure region

# OR
# Plot the the ratified raster using the rasterVis package and the points using latticeExtra
rasterVis::levelplot(
  x = BASEMAP2018ReCls2, # Raster to plot?
  col.regions = c("grey", "light green", "black", "red", "orange", "dark green", "light blue", "white"), # Use the classcolor variable in classdf to define the colours.
  main = "Land Cover Clasifiction"
) +
  latticeExtra::layer(sp.points(samp2018, pch = 10, col = "red"))

As you can see from the figure above, you have sampled many places of the BASEMAP2018ReCls Landcover raster. But how can you be sure that each class has 200 random observations? For this, you can tabulate the results from your stratified random sampling using the function table (look into ?table to know how the function works)

Your task:

Tabulate the results from your stratified random sampling using the function table (look into ?table to know how the function works)

# Extract the land cover classes identities for each plot sampled in the stratified random sample.
SampLCC <- samp2018$layer

# Use the table() function to tabulate the number of land cover classes sampled by the `sampleStratified` function
table(SampLCC)
## SampLCC
##   1   2   3   5   6   7 
## 200 200 200 200 200 200

The table above clearly shows that each reclassified land cover class has been sampled precisely 200 times.

Extract values for sites

Here you will go back to the Landsat data used at the beginning of the tutorial (landsat8) to generate the NDVI layer. As you now have a series of sites with an associated Land cover class, you can extract the cell values from the landsat8 RasterStack. These band values will be the predictor variables and “class values” from BASEMAP2018ReCls will be the response variable. To extract these, you will use the function extract.

Your task:

Use the function extract to extract for each randomly selected site per Landcover class the hyperspectral bands in landsat8.

# Extract the values from `landsat8`
sampvals <- extract(
  x = landsat8, # Define the raster to be samples.
  y = samp2018, # Define the spatial vector data with the locations to sample.
  df = TRUE # Return the results as a data.frame?
)

# sampvals no longer has the spatial information!!. To keep the spatial information, you use `sp = TRUE` argument in the `extract` function.

# Now some house keeping!!
# drop the ID column
sampvals <- sampvals[, -1]
# combine the class information with extracted values
sampdata <- data.frame(
  classvalue = samp2018@data$layer, # Get the Landcover "class" for each selected location.
  sampvals # The data.frame with the hyperspectral bands in `landsat8`.
)
# Explore the resulting data.frame using the head() function.
head(sampdata)
##   classvalue    Coastal       blue      green        red       NIR     SWIR1      SWIR2
## 1          1 0.05020875 0.05723500 0.07900125 0.08249375 0.2067525 0.1701500 0.12917501
## 2          1 0.01758000 0.02712250 0.05896750 0.05049750 0.2681875 0.1557675 0.09548750
## 3          1 0.03716000 0.04480500 0.07153500 0.06622750 0.3369650 0.1763100 0.09565250
## 4          1 0.03150875 0.04237125 0.05815625 0.06081000 0.0935900 0.1300963 0.11790000
## 5          1 0.02963875 0.03431375 0.05477375 0.04809125 0.2436300 0.1471050 0.09195375
## 6          1 0.02778250 0.03355750 0.06048000 0.05294500 0.2748700 0.1597000 0.10382000

Train the RF classifier

Now you will train the RF classification algorithm using a sampdata dataset. Here, land cover classes are the response variable (set up as a factor as these are coded as numbers in the raster) and the ‘blue’, ‘green’, ‘red’, ‘NIR’, ‘SWIR1’, ‘SWIR2’ bands are the predictors.

The first question to ask is why not use a classification model?, well have you ever wondered why do we ask multiple people about their opinions or reviews before going to a film, or before buying a car or maybe, before planning a holiday? It’s because the review of one person may be biased as per her/his preference. However, when you ask multiple people, you try to remove the bias a single person may provide. By this, I mean, one person may have an extreme dislike for a specific destination because of her experience at that location; however, ten other people may have a very strong preference for the same destination because they have had a wonderful experience there

In the world of analytics and data science, this is called ‘assembling’. Assembling is a type of supervised learning technique. In this technique, multiple models are trained on a training dataset, and their outputs are combined by some rule to derive the final consensus output. This is the benefit of random forests (RF) when compared to single regression trees.

Random Forest is one very powerful assembling classification algorithm that works by creating multiple decision trees and then combining the output generated by each decision tree. Decision trees are the classification models you applied in the section above, which works on the concept of information gain at every node. For all the data points, the decision tree will try to classify data points at each node and check for information gain at each node. It will then classify at the node where information gain is maximum. It will follow this process subsequently until all the nodes are exhausted or there is no further information gain. Decision trees are easy to understand models. However, they have low predictive power. They are called weak learners. Random Forest works on the same weak learners, but it combines these when it finally comes up with its’ output.

Another point that sets Random Forest apart is that it does use all the data points and variables in each of the trees. It randomly samples data points and variables in each of the trees it creates and then combines the output at the end - this approach is called a bagging method. By doing this, Random Forest removes the bias that a decision tree model might introduce in the system. Also, it improves the predictive power significantly.

Your task:

Using the function randomForest() from the randomForest package, build a random forest model that predicts land cover classes based on the hyperspectral signal of sampled sites.

# Load the required package (randomForest)
library(randomForest)

# Build a random forest model that predicts land cover classes based on the hyperspectral signal of sampled sites.
# Your response variable should be a factor.
RF.model <- randomForest(
  formula = as.factor(classvalue) ~ ., # Define a formula linking the response (land cover classes) to the predictors (bands in landsat8) - remember to Define the Land cover class as a factor
  data = sampdata, # Define the data frame containing the variables in the model.
  ntree = 500, # Define the number of trees to grow. By default, the function builds 500 trees.
  mtry = 2, # Define the Number of variables randomly sampled as candidates at each split. By default, the function uses two (2) predictors per tree.
  importance = TRUE # Should the importance of predictors be assessed?
)
# Print the randomForest just created
RF.model
## 
## Call:
##  randomForest(formula = as.factor(classvalue) ~ ., data = sampdata,      ntree = 500, mtry = 2, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 43.58%
## Confusion matrix:
##    1  2  3   5   6   7 class.error
## 1 55 33 76  24  10   2       0.725
## 2 29 89 29  27  26   0       0.555
## 3 70 20 77  20  12   1       0.615
## 5 11 16 13 138  22   0       0.310
## 6  7 25  8  29 130   1       0.350
## 7  5  3  2   0   2 188       0.060

Based on the output above, you can see that the error rate is ~43% which is relatively high. So, to improve this, you might need to build more trees and/or use more predictors - this means tuning the random forest model.

You can tune the random forest model by changing the number of trees (ntree) and the number of variables randomly sampled at each stage (mtry). According to the randomForest package description:

  • three: Number of trees to grow. This should not be defined too small to ensure that every input row gets predicted at least a few times.

  • mtry: Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is the number of variables in x) and regression (p/3).

You can explore how the change in the number of trees will affect the Out-of-bag (OOB) error. You will now build a loop that evaluates how much the error change as the number of trees used in the forest increases.

Your task:

Explore how does the change in the number of trees will affect the Out-of-bag (OOB) error.

For this, build a loop (using an for() loop) that evaluates how much does the error change as the number of trees used in the model (defined with the ntree argument) changes between 100, 500, 1000, 2000, 5000

# Using a Loop, evaluate how much the Out-of-bag (OOB) error changes as the number of trees shifts between  100, 500, 1000, 2000,  and 5000

# Create a summary table
RF.Error.ntree <- data.frame(
  ntree = c(100, 500, 1000, 2000, 5000), # A vector with the Number of trees
  OOB = NA # Set the OOB to NA - will be filled for each Number of trees below1
)

# Loop over RF.Error.ntree$ntree to assess the impact of tree number on OOB-error.
for (i in RF.Error.ntree$ntree) {
  # Set the random seed for repeatability
  set.seed(99)
  # Estimate the randomForest for the evaluated Number of trees
  RF.mode.tmp <- randomForest(
    formula = as.factor(classvalue) ~ ., # Define a formula linking the response (land cover classes) to the predictors (bands in landsat8) - remember to Define the Land cover class as a factor
    data = sampdata, # Define the data frame containing the variables in the model.
    ntree = i, # Define the number of trees to grow.
    mtry = 2, # Define the Number of variables randomly sampled at each split. Use the Default = 2.
    importance = TRUE # Should importance of predictors be assessed?
  )
  # Add the estimated OOB-error to `RF.Error.ntree` - the summary data.frame
  RF.Error.ntree$OOB[RF.Error.ntree$ntree == i] <- RF.mode.tmp$err.rate[dim(RF.mode.tmp$err.rate)[1]]
}
# Print the summary data.frame
RF.Error.ntree
##   ntree       OOB
## 1   100 0.4508333
## 2   500 0.4383333
## 3  1000 0.4425000
## 4  2000 0.4408333
## 5  5000 0.4450000

The change in the Out-of-bag (OOB) error shows worst models as the number of trees increases form 500 trees. The question then is, how would the OOB-error change as a function of the number of predictors used per-tree changes.

Your task:

Explore how the change in the number of predictors used per tree affects the Out-of-bag (OOB) error.

For this, build a loop (using a for() loop) evaluating how much the error changes as the number of variables randomly sampled as candidates at each split (defined with the mtry argument) changes between 1 to 7.

# Using a Loop evaluate how much does the Out-of-bag (OOB) error as the number of variables pre tree increase.
# Create a summary table
RF.Error.mtry <- data.frame(
  mtry = 1:7, # A vector with the Number of variables
  OOB = NA # Set the OOB to NA - will be filled for each Number of trees below1
)
# Loop over RF.Error.mtry$mtry to assess the impact of tree number on OOB-error.
for (i in RF.Error.mtry$mtry) {
  # Set the random seed for repeatability
  set.seed(99)
  # Estimate the randomForest for the evaluated Number of variables
  RF.mode.tmp <- randomForest(
    formula = as.factor(classvalue) ~ ., # Define a formula linking the response (land cover classes) to the predictors (bands in landsat8) - remember to Define the Land cover class as a factor
    data = sampdata, # Define the data frame containing the variables in the model.
    ntree = 500, # Define the number of trees to grow. Use the Default = 500
    mtry = i, # Define the Number of variables randomly sampled at each split.
    importance = TRUE # Should importance of predictors be assessed?
  )
  # Add the estimated OOB-error to the summary data.frame
  RF.Error.mtry$OOB[RF.Error.mtry$mtry == i] <- RF.mode.tmp$err.rate[dim(RF.mode.tmp$err.rate)[1]]
}
# Print the summary data.frame
RF.Error.mtry
##   mtry       OOB
## 1    1 0.4450000
## 2    2 0.4383333
## 3    3 0.4325000
## 4    4 0.4275000
## 5    5 0.4341667
## 6    6 0.4291667
## 7    7 0.4383333

The Out-of-bag (OOB) change reaches a minimum when four (4) variables are used in building a tree. This indicates a need to change the default values of the randomForest functions produce the best model for the used training dataset, but we will stick with the default values as the changes in OOB are not very large.

Although the Out-of-bag (OOB) error values might seem large, what it is showing you is that the average predictive ability of any single tree is terrible. This is also reflected in the confusion matrix produced when printing RF.model. To get some insights on what are confusion matrices, check out: https://bit.ly/3ltpkyk.

However, it is essential to say that the model predicts perfectly the training dataset. To view this, you need to tabulate the predicted and observed classes and assess the level of misclassification (difference between the observed and predicted value).

Your task:

Tabulate the observed vs. predicted classification and determine the level of misclassification when the model predicts the training dataset.

# predict the values for the training dataset using the predict function
predRF <- predict(
  object = RF.model, # Define the randomForest with the model.
  newdata = sampdata, # Define data.frame used to make the prediction - here is the training data.
  type = "response" # indicating the type of output:
)
# Tabulate the observed vs. predicted classification - this is a type of confusion matrix!
table(predRF, sampdata$classvalue)
##       
## predRF   1   2   3   5   6   7
##      1 200   0   0   0   0   0
##      2   0 200   0   0   0   0
##      3   0   0 200   0   0   0
##      5   0   0   0 200   0   0
##      6   0   0   0   0 200   0
##      7   0   0   0   0   0 200

The table above shows that there is zero misclassification in the case of prediction on the training dataset. However, to properly assess how good the model you need to evaluate the model in a “testing dataset that is different (and preferably independent) from the”training” dataset.

The last question to ask is how important is given variable is to discriminate between land Cover Classes. As you defined the importance argument when building your Random Forest model, you can extract these using the importance and varImpPlot functions from the randomForest package.

Your task:

Assess the importance of the used predictors to discriminate between land Cover Classes.

# Extract variable importance measures for the best attribute configuration of your randomForest model using the importance() function.
importance(RF.model)
##                 1         2         3        5         6         7 MeanDecreaseAccuracy MeanDecreaseGini
## Coastal 0.7357383 28.709203  2.381668 26.62737  9.211609  4.954223             29.99518         118.4440
## blue    6.6011793 13.367544 13.746580 23.91673 21.396853  4.693822             33.06819         120.5938
## green   0.3892797 15.015633  8.848622 23.51980 33.176215  5.212055             33.26567         144.4860
## red     7.8559635  5.773528 21.561917 24.75848 20.633487  7.801037             31.97582         135.3020
## NIR     6.1042268 30.459215 16.922784 41.03353 18.477913 15.429188             31.37197         173.6557
## SWIR1   3.7179664 21.713725 13.306461 30.50167 19.520763 16.638204             25.67379         151.9236
## SWIR2   6.1146173 14.338691 19.880305 28.56497 34.121498 16.259274             30.41735         154.7797
# Plot the variable importance measures for the best attribute configuration of your randomForest model using the varImpPlot() function

varImpPlot(RF.model)

Based on the MeanDecreaseAccuracy, the most important variable is green, and the least important variable is SWIR1. In the case of MeanDecreaseGini, the relative change in node purity is the largest with NRI and the smallest with blue.

Model evaluation - K-fold validation

Now that you have a classification model, you can evaluate its’ prediction accuracy using a k-fold validation approach. In this technique, the data used to fit the model is split into k groups (typically five groups). In turn, one of the groups will be used for model testing, while you will use the rest of the data for model training (fitting). For this, you will use the fold function of the package dismo).

Your task:

Do a five (5) k-fold partitioning of the training dataset, stratifying the folds by the Land Cover Class

# Load the required library (dismo)
library(dismo)

# set the random seed for reproducibility
set.seed(99)

# Do a k-fold partitioning of the training dataset
k.foldDta <- kfold(sampdata, # the source data to use (the sample)
  k = 5, # number of Folds (here you use 5)
  by = sampdata$classvalue
) # define how to stratify the folds

# Tabulate the k-fold partitioning dataset by the Land Cover Class to see the number of samples per land cover class to be used in each fold.
table(k.foldDta, sampdata$classvalue)
##          
## k.foldDta  1  2  3  5  6  7
##         1 40 40 40 40 40 40
##         2 40 40 40 40 40 40
##         3 40 40 40 40 40 40
##         4 40 40 40 40 40 40
##         5 40 40 40 40 40 40

With the k-fold partitioning, you can now train and test your Random Forest models five times - five because you have five-folds!

Your task:

Train (using those points NOT in the k-fold dataset) and test (using those points IN the k-fold dataset) your Random Forest model.

For each of the five (5) folds, compute a confusion matrix and store it as a component in a list named RF.Kfold.List.

Finish by estimating the average confusion matrix over the five folds.

# Kfold Validation for the RF model
RF.Kfold.List <- lapply(1:5, function(k) {
  # Define the Training data.frame - Observations NOT selected in the k-fold.
  train <- sampdata[k.foldDta != k, ]
  # Define the test data.frame - Observations selected in the k-fold.
  test <- sampdata[k.foldDta == k, ]
  # build a new model using the Train dataset
  RF.model.kfold.Mod <- randomForest(
    formula = as.factor(classvalue) ~ ., # Define a formula linking the response (land cover classes) to the predictors (bands in landsat8) - remember to Define the Land cover class as a factor
    data = train, # Define the data frame containing the variables in the model.
    ntree = 500, # Define the number of trees to grow. Use the Default = 500
    mtry = i, # Define the Number of variables randomly sampled at each split.
    importance = TRUE # Should importance of predictors be assessed?
  )
  # Predict the k-fold model
  pclass <- predict(
    object = RF.model.kfold.Mod, # Define the randomForest with the model.
    newdata = test, # Define data.frame used to make the prediction - here is the training data.
    type = "response" # indicating the type of output
  )
  # create a data.frame using the reference and prediction
  cbind(
    Obsered = train$classvalue,
    Predicted = as.integer(pclass)
  )
})

# Confusion matrix for RF model the kfold validation
# Merge the tables in the list row wise
RF.Kfold <- do.call(
  "rbind",
  RF.Kfold.List
)
# Make the matrix a data.frame
RF.Kfold <- data.frame(RF.Kfold)

# Set the column names for the data.frame.
colnames(RF.Kfold) <- c("observed", "predicted")

# Tabulate the (miss)matches between the observed  and predicted test datasets
RF.conmat <- table(RF.Kfold)

# Change the name of the classes in the tabulation.
colnames(RF.conmat) <- c("Build", "Recreation", "Infrastructure", "Agriculture", "Nature", "Aquatic")
rownames(RF.conmat) <- c("Build", "Recreation", "Infrastructure", "Agriculture", "Nature", "Aquatic")
# Print the confusion matrix
RF.conmat
##                 predicted
## observed         Build Recreation Infrastructure Agriculture Nature Aquatic
##   Build            182        164            186         193     72       3
##   Recreation       109        160            106          71    165     189
##   Infrastructure   101         68             98         180    165     188
##   Agriculture      182        164            186         193     72       3
##   Nature           109        160            106          71    165     189
##   Aquatic          101         68             98         180    165     188

Your task:

Estimate the miss-classification rate (proportion of observations wrongly classified by the model) for the Random Forest model.

### Miss classification of the RF model
RF.class.error <- sapply(
  colnames(RF.conmat),
  function(i) {
    sum(RF.conmat[i, !colnames(RF.conmat) %in% i])
  }
) / apply(RF.conmat, 1, sum)
# Print the miss-classification of the RF model
RF.class.error
##          Build     Recreation Infrastructure    Agriculture         Nature        Aquatic 
##        0.77250        0.80000        0.87750        0.75875        0.79375        0.76500
# Make a barplot of the miss-classification of the RF model
barplot(RF.class.error,
  main = " RF model k-fold miss-classification rate",
  ylim = c(0, 1)
)

Based on this approach, the within overall model accuracy is similar between classes fo the RF models.

Model evaluation - Overall accuracy and kappa.

You can also assess the accuracy using a subset of the data - and this is a preferred option and the core of cross-validation. When doing this model test, you get an idea of how accurate the classified map might be. Two widely used measures in remote sensing are “overall accuracy” and “kappa”.

Overall accuracy is the probability that a test will correctly classify an individual; that is, the sum of the true positives plus true negatives divided by the number of individuals tested. In other words, how often do you predict a correct classification?

Your task:

Estimate the overall accuracy (proportion of cases correctly classified) for the Random Forest model.

# Overall accuracy estimation for the RF Model
## number of cases
RF.n <- sum(RF.conmat)
RF.n
## [1] 4800
# number of correctly classified cases per class in the RF model
RF.diag <- diag(RF.conmat)
# Overall Accuracy of the RF model
RF.OA <- sum(RF.diag) / RF.n
RF.OA
## [1] 0.2054167

By comparison, Cohen’s “kappa” coefficient measures the agreement between two matrices who each classify \(N\) items into \(C\) mutually exclusive categories.

A simple way to think this is that Cohen’s Kappa is a quantitative measure of reliability for two matrices that are rating the same thing, corrected for how often the matrices may agree by chance. It is generally thought to be a more robust measure than a simple per cent agreement calculation, as it considers the possibility of the agreement occurring by chance. For extra information on the index, see https://bit.ly/3Ahju9q.

Your task:

Estimate the Cohen’s “kappa” coefficient (overall accuracy corrected by the expected accuracy) for the Random Forest model.

# observed (true) cases per class
## Estimation for the RF model

# Determine the total number of observations OBSERVED for a given land cover class (the column sum)
RF.rowsums <- apply(RF.conmat, 1, sum)
# Define the probability of each OBSERVED land cover class (total-OBSERVED / total number of cases)
RF.p <- RF.rowsums / RF.n

# Determine the total number of observations PREDICTED for a given land cover class (the column sum)
RF.colsums <- apply(RF.conmat, 2, sum)
# Define the probability of each PREDICTED each land cover class (total-PREDICTED / total number of cases)
RF.q <- RF.colsums / RF.n

# Estimate the Expected Accuracy - the sum of probability of each OBSERVED and PREDICTED land cover class.
RF.expAccuracy <- sum(RF.p * RF.q)

# Estimate the Kapa - overall accuracy divided by one MINUS the expected accuracy
RF.kappa <- (RF.OA - RF.expAccuracy) / (1 - RF.expAccuracy)

# Print
RF.kappa
## [1] 0.0465

Model evaluation - Producer’s and User’s Accuracy

The Producer’s Accuracy is the map accuracy from the point of view of the mapmaker (the producer). This is how often are real features on the ground correctly shown on the classified map or the probability that a particular land cover of an area on the ground is classified as such. The Producer’s Accuracy is the number of reference sites classified accurately divided by the total number of reference sites for that class.

The User’s Accuracy is the accuracy from the point of view of a map user. The User’s accuracy essentially tells you how often the class on the map will be present on the ground. This is referred to as reliability. The User’s Accuracy is calculated by taking the number of correct classifications for a particular class and dividing it by the row total.

Your task:

Estimate the producer accuracy (number of reference sites classified accurately divided by the total number of reference sites) for the Random Forest model.

# Producer  Accuracy - The RF case [correct classification per class / number of reference sites for that class]
RF.PA <- RF.diag / RF.colsums
# The overall PA is the mean of the percentage of correct classifications
mean(RF.PA)
## [1] 0.2053

Your task:

Estimate the user accuracy (total number of correct classifications for a particular class and dividing it by the row total) for the Random Forest model.

# User accuracy - The RF case [correct classification per class / row total for that class]
RF.UA <- RF.diag / RF.rowsums
# The overall UA is the mean of the percentage of correct classifications
mean(RF.UA)
## [1] 0.2054167

Global Model evaluation

Now that you have trained your classification model, you can use it to make a prediction - even as these might not be the best models!. For this, you will use the RF-model to classify all cells in landsat8. It is important that the layer names in the RasterStack object to be classified exactly match those used as predictors in the model. This would be the case if the same Raster object were used (via extract) to obtain the values to fit the model.

Your task:

Use your Random Forest model to generate predicted land cover maps for the region of interest (landsat8).

# Use the predict function (as implemented for raster objects and the Random Forest model to generate predictors of land Cover classes for 2011.
RF.pr2018 <- predict(
  object = landsat8, # Define the raster used to make the predictions.
  model = RF.model, # Define the model used to make the predictions..
  type = "response" # Define the type of predicted value returned
)
# Plot the RF model PREDICTED map using the image() function and the HEX colours in classdf - remember that there is no class 6 in classdf, and so it does not have a colour.
image(RF.pr2018,
  main = "RF Predicted Land cover classification 2011", # Give the Name
  col = c("grey", "light green", "black", "red", "orange", "dark green", "light blue")
)

# Add a legend linking land cover class names to the colours.
legend("topright", # Define the position the legend.
  # inset = c(-0.24,0), # Define inset distance(s) from the margins.
  fill = c("grey", "light green", "black", "red", "orange", "dark green", "blue", "white"), # Define the colours of each land cover class.
  legend = c("Build", "Recreation", "Infrastructure", "Mining", "Agriculture", "Nature", "Aquatic", "Unmapped"), # Define each land cover class name.
  cex = 0.7, # Define the text size
  xpd = NA
) # Plot outside the Figure region

Focusing on the modes, plotting and counting the misclassified cells (observed != Pred), you can get an overview of the classification error.

Your task:

Show the misclassified cells (colour them RED) and those classified correctly (colour them BLUE) by the Random Forest model.

Also, estimate the proportion of all the cells in landsat8 that are misclassified cells by the Random Forest model.

# Use the image function to show the cells that are misclassified (Predicted != Observed) and classified correctly (Predicted == Observed) based on the Random forest model
image(RF.pr2018 == BASEMAP2018ReCls,
  main = "RF missclassified cells",
  col = c("red", "blue")
)

# Estimate the proportion of misclassified cells (Predicted != Observed)/Ncells for the RF model.
sum((RF.pr2018 != BASEMAP2018ReCls)[], na.rm = T) / ncell(BASEMAP2018ReCls)
## [1] 0.4135609

Although visual inspections are OK, and counts of miss-classified cells show you the quality of the classification. You also want to assess the per-class error rate (the number of observations in a class the algorithm does miss-classify)

Your task:

Estimate the confusion matrix for the rasters predicted using the Random forest model.

## Use the table() function to estimate the confusion matrix for the RF model
RF.ConfMtx <- table(
  BASEMAP2018ReCls[],
  RF.pr2018[]
)
rownames(RF.ConfMtx) <- colnames(RF.ConfMtx) <- c("Build", "Recreation", "Infrastructure", "Agriculture", "Nature", "Aquatic")
RF.ConfMtx
##                 
##                  Build Recreation Infrastructure Agriculture Nature Aquatic
##   Build          15377       6051          15295        4458   2219     614
##   Recreation      1440       3575           1108        1361    986      23
##   Infrastructure  7301       2630          11421        2501   1487     236
##   Agriculture     2334       2588           2210       18946   4139      15
##   Nature           788       2073            617        3226  12207     148
##   Aquatic          330        148            408         209    747   29034
# Define the per-class error rate [proportion of of the cells for a given land cover class correctly classified]
RFclass.error <- sapply(
  colnames(RF.ConfMtx),
  function(i) {
    sum(RF.ConfMtx[i, !colnames(RF.ConfMtx) %in% i])
  }
) / apply(RF.ConfMtx, 1, sum)
RFclass.error
##          Build     Recreation Infrastructure    Agriculture         Nature        Aquatic 
##     0.65063389     0.57906511     0.55344855     0.37331305     0.35951519     0.05965799

Based on this is clear that your model is very good in discriminating areas classified as aquatic, and does a bad job in classifying areas classified as Build

Final points

Supervised classification is the technique most often used for the quantitative analysis of remote sensing image data. At its core is the concept of segmenting the spectral domain into regions that can be associated with the ground cover classes of interest to a particular application.

For example, you could use the PCA axes instead of the raw data in the Landsat image. Alternatively, you could use more points to develop your model (right now, you use 200 points that are not even 1% of the available data - more data gives you a better model but requires more computer power!!). I leave these tasks to the interested student! This exercise gave you a guide of the procedure to do so, and although your final model is not the best, it is the starting point for developing this type of analysis.